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ABSTRACT 



DIAGNOSTIC TESTING: FIELD TRIAL RESULTS 
David L. McArthur and Beverly Cabello 
UCLA Center for the Study of Evaluation 



A diagnostic testing system managed by microcomputer was evaluated In 
actual use at the upper primary level. Two tests specifically designed to 
yield diagnostic indicators of erroneous performance were utilized, one a 
test of pronoun usage, the other a test of reading comprehension. The 
results are Interpreted from the standpoint of the examinees, of the tests, 
and of the computer software. Lessons about the viability of test 
management software in real-time use, about the patterning of erroneous 
responses, and about the efficiency of diagnostic testing by computer are 
discussed. 



INTRODUCTION 



There are few problems as widespread among primary school pupils as the 
inability to comprehend written materials. However, in the realm of 
diagnostic testing of reading comprehension, progress has been limited. 
Generally a student is asked to take a test from beginning to end, and only 
upon completion does diagnostically useful information about that student's 
abilities begin to emerge. Frequently the interpretation is based solely 
on percentage scores derived by relatively elementary algorithms concerning 
tallies of right and wrong responding. 

If the student encounter with the test could be guided by an "actively 
involved onlooker," the test itself could proceed selectively, and the 
student encouraged to tackle questions which are increasingly "well -suited" 
in the sense that they become increasingly closer estimates about the 
student's optimal level of functioning. Each estimate would be supported by 
one or more diagnostic hypotheses, with suitable confidence levels. 
Ideally, such an adaptive sequencing of tasks would explore hypotheses 
about the student's abilities based on early responses, and guide the 
student toward those tasks which have high probative value, a high 
likelihood of information return. Ad seriatim tests are seldom able to 
accomplish this. The "active onlooker," however, can take the form of a 
test specialist working one-on-one, or a computer algorithm suitable for 
real-time shaping c* the testing task, working either rule-based or 
probabilistically. To do this requires both suitable heuristics on which 
to build adequate diagnostic inferences and a systematic way of moving 
rationally between test tasks. 
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A rule-based system, DX, functioning as a dedicated test 
administration/feedback device was programmed for use in conjunction with a 
test of reading specifically designed around diagnostic principles. 
The role of "active onlooker" was taken by a real-time Interactive program 
called DX, described below. First, however, we present a review of the 
Important concepts which drive the construction of a diagnostic test, and 
the results of two pilot studies conducted to evaluate and refine suitable 
test Instruments. 
Diagnostic test construction 

The underlying goal In the development of the diagnostic tests was to 
create a profile of scores for each Individual as well as for groups of 
students, which teachers can use to diagnose specific areas of difficulty. 
This Is accomplished by 1) designing a test which rigorously measures those 
factors In a domain which can affect performance; and 2) determining the 
level and consistency of students' performance across Items and item 
clusters which measure those factors. 

A domain referenced approach was used to design test Items for two 
separate diagnostic tests, the first in pronouns, a very well-bounded area 
In language skills, the second In more general abilities of reading 
comprehension. The approach that assumes that the main goal of testing is to 
assess an Individual's status with respect to a skill requires a thorough 
understanding and specification of the domain to be addressed. The 
resulting domain specification provides a blueprint for developing test 
Items, by way of a conceptual map of the skill to be assessed. The 
blueprint was drafted by reference to extant curricular material, subject 
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area specialists, and research on the structure of the knowledge base and 
the nature of learning. In addition, we Identified Important factors 
within the domain that might cause an Item to be more or less difficult or 
a student's performance to vary. Items representing these factors were 
used to produce a test with diagnostic utility* one which Identifies the 
reasons for the student's performance level. 

PILOT STUDY; PRONOUN TEST 

Within the framework of domain referenced testing, a diagnostic test 
was developed to examine pronoun use by fourth-sixth graders. This test was 
used to establish evidence for manipulating a number of variables within 
the test structure. Variables Included four factors representing content 
structure and one representing cognitive complexity. At least In theory, 
each pronoun type could be classified by form, number.; and person. There 
are two types of form: relative form (who or whom) and non-relative form. 
Number pertains to singular (she) and plural (they). Person can be of 
three types: first (I, we), second (you), and third (he, she, they). Since 
Items measuring the second person would have sounded contrived to the 
reader, the test Included only the first and third persons. 

Two levels of cognitive complexity were used In this test, 
corresponding to whether students had to use the context of a reading 
passage to determine the correct pronoun. In the first level, the pronoun 
referent was given and the student needed only to associate that referent 
with the correct pronoun. In the second, more complex level, students were 
presented with a short paragraph that Included a blank In the place of one 
noun; students needed to use the context of the paragraph to Identify the 
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referent that was appropriate to the blank and then select the correct j 

pronoun for that referent. The correct pronoun could be determined only 

from elements of the paragraph In which the pronoun was embedded. 

Consequently, the test used two levels of embeddedness corresponding to two j 

levels of cognitive complexity. 

The Ideal Pronoun test would have Items for every combination of the 
five factors. Since the form, embeddedness, person, and the number factors 4 
each had two levels and the rule factor had five levels, a complete test 
would have 80 (2x2x2x2x5) combinations. However, for several combinations 
of factors, sensible Items could not be written. First, non-embedded Items < 
could not be written to elicit singular first person pronouns (I, me, or 
my). Second, Items testing the relative form of first-person pronouns 
would have been contrived. Third, English does not use any relative form 1 
of possessive pronouns. 

The test used a multiple choice format with five alternatives per 
Item, consisting of the correct response, three dlstractors which were < 
correct In all ways but one, and a fourth dlstractor which was correct only 
in one way or not at all. An example Is the Item, "Mom praised Mary and 
S tevle ", with the following alternatives: them, they, us, him and she. The < 
correct response (them) Is an objective, plural third-person pronoun. The 
next three responses (they, us, and him) were correct on two of the next 
three factors (rule, number, person). The final response (she) was correct ' 
only In the person. The last response was considered a "wild card" 
dlstractor (a highly unlikely selection), Included to detect guessing or 
carelessness. 
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Test Administration 

Sample . Sixth-grade students from three elementary schools within a 
local Inner-city district were Involved In this study. These schools are 
located In a low to middle SES area with a high rate transient and mixed 
population. Approximately 902 of the students were of Hispanic background, 
6% were Black, 2% were Asian, and 2% were non-minority Whites. There were 
79 students classified as FEP (Fluent English Proficient) and 49 classified 
as LEP (Limited English Proficient), based on district reclassification 
criteria of language proficiency tests, achievement tests, and teacher 
judgments. 

Procedure . Two forms of the diagnostic Pronoun test were prepared. 
Both contained the same Items but the order of the Items was Inverted. 
After pilot administrations and feedback from teachers and students, the 
92-1 tern diagnostic test for pronouns was administered by project staff to 
128 pupils. Test Instructions allowed the administrators to clarify the 
meaning in vocabulary Item stems but not In Item dlstractors. Students 
were allowed up to 90 minutes to complete the test although most students 
finished the test In under 60 minutes. Classroom teachers were present 
during testing. 

RESULTS 

Performance on the Pronoun test was analysed using general Izabillty 
theory, a measurement theory designed to assess multiple sources of 
variation In a measurement . General liability analysis Is particularly 
suitable to answering whether all students have difficulty with the same 
material (for example, all students misunderstand how to use relative 
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pronouns, or all students have difficulty with the sequence questions for 
expository passages). If true, then a single profile for the whole class 
may suffice for diagnosing areas of difficulty. If some material Is 
particularly troublesome to some students and not to others, then profiles 
for Individual students may be necessary. 

General Izablllty analysis also Indicates If students perform equally 
well on a cluster of dimensions (such as all nominative, objective, and 
possessive pronouns; or all of the narrative passages). If so 3 then it would 
not be necessary to provide separate scores for each of those dimensions. 
On the other hand, If mastery of one dimension (such as nominative pronouns 
or passages which are narrative and contain much explicit Information) Is 
much greater than mastery of other dimensions (such as relative pronouns or 
expository passages), then It would be necessary to profile separate scores 
for each dimension. Additionally, this analytic technique Indicates the 
number of Items that are needed to reliably measure each skill presented in 
the profile. 

Preliminary analyses examined whether there were distinct population 
subgroups 1n the design. Analyses of variance Indicated that the only 
population characteristic Influencing performance was language background 
(FEP vs LEP; F* 30.09, p< .001). The statistical tests for classroom, 
school, ethnic background, and age were not significant. Only the 
distinction between FEP and LEP was maintained In subsequent analyses. 

The greatest sources of variance were due to pronoun form (relative vs 
non-relative); context (embedded vs non-embedded) and the pronoun usage 
rule (whether the pronoun Is a direct or indirect object, for example). 
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PILOT STUDY: READING COMPREHENSION TEST 
An extensive review of the literature on models of reading 
comprehension was conducted. Teachers and current texts for reading 
comprehension Instruction were consulted to determine the extent to which 
test ski His and variables were considered In the practical context of 
Instruction. We concluded that the diagnostic assessment of reading should 
consider not only the comprehension skills to be assessed, but also the 
attributes of the text on which the assessment Is based. Thus we selected 
the five most comprehensive models of reading; table 1 shows those reading 
comprehension skills and text variables Identified as critical by the 
models. We reviewed the literature to determine which of these skills and 
variables had been well researched and what methodologies had been 
employed. We then selected those skills and text variables which had been 
most widely and vigorously researched. 

Our literature review Indicated that students' performance varies 
significantly In relation to the particular attributes of the text they 
read. Both novice and expert readers find some kinds of text, such as 
science passages written In the expository genre, more difficult than other 
kinds of text, such as fables. However, most of the studies examined only 
one or two features, such as syntactic complexity, degree of Implicit 
Information, or genre. Few examined a complex of features. We suspected 
that the degree of text difficulty may depend on the combination of 
features rather than on any single feature. Thus the passages used in the 
Reading Comprehension test represent all of the combinations of three major 
text features \ genre (expository vs. narrative); syntactic complexity 
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(many vs. few subordinate clauses); and the degree to which information is 
explicitly or Implicitly expressed. The currlcular review yielded content 
and passages which were adapted for the test. 

Student performance is also affected by the cognitive complexity of 
the processing of text. This is determined, In part, by the kinds of 
Instructions, tasks, or questions readers are asked to attend to when 
reading text. These fulfill at least three functions. They set the 
reading goai for the reader; the amount and nature of Information the 
reader must process; and how the reader needs to process the text (recall, 
synthesize, integrate, analyze, apply, etc.). The currlcular and 
literature review Indicated the following kinds of reading tasks/questions 
have been either researched or appear frequently in the curricula: literal 
comprehension questions, Inference questions, main idea questions, and 
sequence questions. 

To Investigate the impact of each combination of comprehension 
skill/text attributes on te.«t performance, Items were generated for as many 
of the combinations as possible. For each combination, two parallel Items 
were written. To the extent possible, parallel items included texts which 
were of the same genre; contained similar content, syntactical structure, 
and degree of Implicit and expllct Information. 

All passages were accompanied by the same categories of questions: 
one surface main Idea, one underlying main Idea, two literal comprehension, 
and two Inference questions. Sequence questions were written only for the 
expository passages In order to avoid a test which was too lengthy and 
because the literature Indicated that expository passages seem to be more 
difficult for expert as well as novice readers. 
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The narrative passages were based on Aesop's fables both in content 
and In their general story grammar. Such narratives were frequently found 
in the basal readers and presented a distinct, easily replicated structure. 
The expository passages were adaptations of science passages found in basal 
readers. These passages described elt.ier a process, such as metamorphosis, 
or a procedure, such as conducting a simple experiment. 

The test used a multiple choice format with five alternatives per 
Item, consisting of the correct response, three partially correct 
dlstractors and a dls tractor which was not at all correct. The final draft 
of the Reading Comprehension test, containing 20 passages and 136 Items, 
was piloted In.paper-and pencil format. 
Test Administration 

Sample . The sample consisted of fourth, fifth, and sixth graders of 
varied ethnicity, Including Blacks, Hlspanlcs, and Asians as well as 
non-minority students. Two-thirds were located in middle to high income 
areas within Los Angeles; one-third resided In a low to middle income 
area. All students were classified as either native speakers of English or 
as Fluent English Proficient. 

Procedure . Two forms of the test, each containing the same items but 
In different sequence, were administered by staff. This was not a timed 
test. Most students, however, completed the test within 60 to 90 minutes. 
Students were permitted to ask questions regarding the pronunciation of 
words as well as the meaning of vocabulary which were not part of the item 
dlstractors or which constituted part of an answer. 
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RESULTS 

Analyses examined the means and t tests for each Item and for each 
combination of factors Internal consistency across Items; part-whole 
correlations for Items within a passage and analysis of variance. 
Results from these analyses were used to determine which Items were to be 
deleted or rewritten, and to Identify l.evels of difficulty, either by 
passage or question type. 

First, passages were ranked according to their level of difficulty, as 
represented by the means across the cluster of Items which accompanied 
them. Second, each cluster of Items accompanying a passage was examined to 
determine whether the Items fell Into a reasonable range of means (I.e., 
within a ten point spread) and whether the part-whole correlation of Items 
within a passage was significant (.60 or above). If an Item proved to be 
an outlier because of a low correlation or because the Item mean was more 
than ten points above or below the means of other Items In the cluster, the 
Item was examined for possible flaws such as poor dlstractors and 
rewritten. 

These results also Indicated whether the Items and passages fell Into 
a hierarchical structure. Analysis of variance was performed to detect 
Interactions or main effects by the content or cognitive complexity 
factors. These analyses yielded a consistent pattern for passage type 
difficulty {the content factor) across all samples. There were no 
Interactions or main effects for the syntactic complexity variable. 

The analysis yielded Inconsistent and sporadic results regarding the 
effects of question type. The only consistency found was that sequence 
questions proved to be significantly more difficult than other question 
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types for science passages. Surprisingly, literal comprehension questions 
often proved to be more difficult than inference or main idea questions. 

FIELD TRIALS 

Test materials 

Once the Pronoun and Reading Comprehension Tests were constructed and 
evaluated, their Items were made the basis of shorter tests suitable for 
delivery by computer. The computer version of the Pronoun and Reading 
Comprehension Diagnostic Tests was designed to examine three levels of 
difficulty. The levels of difficulty were determined by Item means and 
hierarchies Indicated by the pilot test results. For both tests the levels 
as determined by Item difficulty were as follows: 

Level 1 (least difficult): mean range of .75 to .90 

Level 2 (medium difficulty): mean range of .60 to .74 

Level 3 (most difficult): mean range of .45 to .59 
Constraints of testing time and programming complexity made it necessary to 
select only a subset of items from the paper and pencil test versions of 
the pronoun and comprehension tests. Thus only eighteen Pronoun Test Items 
and six Reading Comprehension Test passages, each with three items, were 
used for the computer adaptation. However, the original visual layout 
of the paper and pencil test was maintained. 

The selection of the Reading Comprehension Test passages was based on 
tlie overall mean for passage difficulty; the means per item; and the 
part-whole correlations for Items within a passage. The overall mean also 
had to fall within the levels described above, as well as the items within 
a passage. The part-whole correlations were used to make sure that no 
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outliers were Included. We also considered passage characteristics in this < 

selection. 

Software 

A micro-computer-based test delivery system was constructed using 1 
Pascal, a language which allows for very efficient code, access to 
nonsequential input files, presentation of screen windows and other 
operational advantages over competing microcomputer languages. The ' 
operating environment was USCD p-system; machine response time under this 
system Is very fast. The flow of testing was constructed according to the 
following rules: 

1) Initial testing begins at the first Item representing a middle 
level of difficulty. No less than four Items at any level are 
administered in sequence. 

2) Four correct responses within a level cause the student to be 
"moved up" to Items at a higher level of difficulty; four wrong 
responses within a level cause the student to be "moved down" to 
Items at a lower level of difficulty. 

3) Testing terminates when V< Item pool Is exhausted, or when the 
next available Item difficulty level has already been used, or when 
four correct responses occur within the highest available level of 
Items, or when four wrong responses occur within the lowest 
available level of items. 

The algorithms which encode these rules actually comprise only 102 of 
the total Pascal code. The remainder Is dedicated to screen management, 
file management, and Item sequencing. To the extent possible, the software 
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was designed to be simultaneously and without contradiction 
user-friendly, robust, and generic. Within certain constraints any 
properly-designed diagnostic test could be delivered by this system. 

Since the design of both tests Included in this study allowed several 
Items to accompany a single Item stem, the program must Interpret 
Instructions which allow It to match the various pieces of text to appear 
on screen together. For each Item the computer must recognize a legitimate 
keyboard response (either "a" or "A" for the selection of choice number 
one) and prompt for a retry If an invalid key Is struck. It must also keep 
count and exit from the retry command If the student Insists on hitting an 
Invalid key repeatedly. The computer must be equipped to properly file 
multiple attempts at the same exam by the same student, and have fall -safe 
devices available for preserving response data If disk space Is full or 
power Is lost. In technical terms, the software is fire-walled. 

Additionally, Pascal 1s a language with a high degree of protocol 
uniformity, which Insures that It can be transported between dissimilar 
equipment with a minimum of software maintenance. The present software has 
shown Itself to be fully compatible with the Apple 11+ and He, using two 
floppy disk drives and an 80-column card. One drive contains the master 
programs and Pascal operating system components, while the other holds the 
test Input files and preserves student response data. About two dozen sets 
of test results can be filed per disk. 
Test Administration 

Sample . One hundred and sixteen students In grades 4 to 6 in seven 
urban schools were Included In this study. Teachers and administrators 
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were asked to select only those chidren whose reading levels were not above 
grade level. Approximately half were of Hispanic background, ZS% Black, 
20% non-minority White, and the remainder Asian. 

Procedure . Two 18-1 tem computer-managed tests were administered to 
each student Individually In quiet school library settings, using an Apple 
11+ or He. Instructions for use of the computer were made available by 
staff members as necessary; Instructions for the test itself were delivered 
onscreen. The majority of these students had prior exposure to the Apple 
computer as part of their math-science curricula. Total testing time 
varied between 12 and 45 minutes per student, with a modal time of under 20 
minutes. Classroom teachers were not present during the testing. 
Administration/ scoring protocols were run as an Intact set, In which the 
student was asked to provide his/her name and the answer to one simple 
trial query, then asked to respond with the appropriate key upon reading 
each question. Items on the Pronoun Test (Appendix A) occupy little of the 
screen, and can be read rapidly; stems and Items of the Reading 
Comprehension Test (Appendix B) require most of a complete screen (80 
columns x 24 lines) and must be read In detail. Screen window management 
allowed the student to realize that each new test Item need not necessitate 
a rereading of the Item stem material, but It was not erased until It was 
no longer useful. For half of the sample, responses were Immediately 
followed for four seconds, by feedback as to the correct answer, all or 
part of the screen was erased as necessary. The other half of the samples 
received no feedback at any time as to their progress. 
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RESULTS 

The analysis of the data gathered during the field trials of the OX system 
is divided into three parts. First we present summaries of student 
performance. We then present data regarding the performance of the testing 
software Itself, and conclude with information about the diagnostic 
Interpretations generated by the software. 
Summary: Students 

Each student faced no less than eight Items and no more than twelve 
from either the Pronoun or Reading Comprehension tests, due to the 
processing rules employed by the test administrator portion of the 
software. At least four questions would be given from the middle level of 
difficulty, and the remainder would be drawn from the higher level If 
performance on the middle level warranted, otherwise from the lower level 
of Item difficulty. On the pronoun test, 592; of all students moved to the 
higher level of Item difficulty: on the comprehension test, only 36* of 
students moved higher. While modal performance Is described by moving In 
the same direction on both tests, 31% of the students who moved up on the 
Pronoun test moved down on the Reading Comprehension test, while 9% of the 
students who moved down on the Pronoun test moved up on Reading 
Comprehension test. 

Scores ranged from 0% to 100* correct In both tests, but only 2% 
achieved perfect scores on both. Table 2 presents average performance 
results by test level. The figures strongly suggest that there Is 
assymmetry between the two tests In terms on difficulty levels: while the 
lower level Items on the Pronoun Test more often are answered correctly 
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than the lower level items on the Reading Comprehension Test, the opposite 
holds for the higher level Items. However, we note here that such 
disparities do not In themselves speak to the success or failure of this 

approach to testing. 

The current reading level for each student was requested from the 
classroom teacher. By conventional measures, 17% of the students were said 
to be functioning at or below grade 4, 44* at grade 5 and 39* at grade 6. 
Reading grade level is significantly related to test perfornance, as shown 
In Figure 1. If the student was In the lower reading grade level, his/her 
performance on average was up to 20% worse than his/her higher-level 
counterparts. This statement holds both for the upper and lower levels of 
test difficulty. Only a small percentage of those lower reading grade 
level students moved to the high levels of test difficulty. At the same 
time, a larger proportion of that group scored no more than one correct 
response In every third question. The number of students moving to the 
higher difficulty level of both tests Is significantly related to the 
reading grade level (F*3.49, p<.05). Only 10% of those below the 5th grade 
level achieved this dual movement upwards; for those at the 5th grade ievel 
the figure Is 27*. and for those at the 6th grade level, 52%. 

In both the Pronoun and Reading Comprehension tests, the dis tractor 
patterns were revealing. The pronoun test produced a uniformity of errors 
- approximately one third of all errors were committed In each of three 
dlstractor categories. Those students moving down committed more 
Nominative pronoun errors, a simple type of error, while those who moved up 
committed more 0bject-As-Prepos1t1on pronoun errors, a more complex error. 
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The Comprehension Test was not as balanced overall: errors of Main-Idea 
were not as frequent as Literal errors, the simpler error type, with the 
latter being made quite often by students who moved up. Table 3 shows the 
distribution of errors by error type within test. 

As part of the field trial design, about half the students were given 
Immediate feedback as to the correctness of their response or the correct 
answer If wrong. Those receiving feedback did no better or worse than 
those not receiving feedback, but In many cases, there was a palpable 
difference In attitude about the experience afterwards. Students In the 
feedback group were more likely to talk openly, and In some cases 
excitedly, about the experience they just completed. A high-scoring 
sixth-grader offered to consult with the programmer: "Whoever designed 
this did pretty good but he could use some of my help...!" 

The first forty-two students who participated In this field trial were 
asked to respond to a simple 15-1 tern questionnaire to detail their views of 
the testing experience. The overall Impression from these responses Is 
that the computer-managed testing seemed both easy and Interesting. None 
Indicated that the computer's Instructions were confusing; few Indicated 
they would be more comfortable taking the tests using paper and pencil. 
These students did not feel uncomfortable with computers in general, but 
the statement "It Is just as easy to take a test on a computer as It Is to 
take a paper-and-pencll test" elicited responses across the entire range, 
from strongly agree to strongly disagree. On the whole, the average 
respondent felt that the Pronoun and Reeding Comprehension Tests were not 
difficult, and that the experience had been fun. 
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The single most frequent complaint from students was about the clarity 
of the computer screen display of the Apple computer. Standard 
school-owned equipment of various vintages and uneven degrees of focus was 
used In the field trials. Not under software control, this lack of screen 
clarity appeared to be a strain on several students and may have led to 
occasional Inadvertant errors. 
Summary Testing; Software 

System performance was generally very good to excellent during this 
field trial. Perhaps the most dramatic failure occurred when a teacher 
came through the door looking for a spare computer power cord, and without 
thinking managed to "pull the plug" on a testing session In progress. 
Typing one's name and answering a mock 'est question proved to have a half 
dozen nontrlvlal variants. In other respects, most faults that could be 
predicted did Indeed occur at least once, but the system Included 
sufficient firewalllng to allow testing to continue. Flrewalllng comprises 
those portions of software programming which enabled fault tolerance and 
system recovery under adverse conditions. For example, duplicate names and 
repeat sessions by the same person are assigned unique directory names. 
Fully one-sixth of the programming code is occupied by firewalllng: 
happily, It generally lived up to expectations. 

The system managed events In the testing session in several respects: 
It presented materials, captured and checked responses, and moved the 
student up or down from the middle range of Item difficulty. After some 
starting problems were resolved, presentation of materials and checking of 
responses proceeded flawlessly; moving up or down must be evaluated 
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separately. In a finely-tuned testing system, such movement would be the 
result of complex probabilistic, heuristic, and/or logic-based real-time 
Interpretations of student performance. The present system used a small 
packet of rules regarding response tallies, separately by test and by Item 
difficulty level within test. Table 4 shows the results of the system's 
decisions with regard to sending a student to a more difficult or less 
difficult set of Items. Test paths regarded as "correct" are those In 
which the number of errors made at the higher level Is not less than those 
made below. 

An example of an Incorrect path Is one In which a student scores 
moderately well on the middle level but falls every Item on the lower level 
of difficulty. Unfortunately this Illogical response pattern Is entirely 
plausible, and Indeed was present In 5% of the Pronoun Test responses, and 
18% of the Reading Comprehension Test responses. The Instances of apparent 
misfits were evenly divided between those students moving up and moving 
down. Fully 30% of students at the 4th grade reading level or less were 
apparently misfitted. Part of this may be due to random or nearly random 
responding by these students. However, during these field trials few 
students at any level were seen to hit the Identical response key 
repeatedly and all appeared to be actively engaged In the testing task. 
Summary; Diagnostic Interpretations 

A relatively small number of error types were Included In the test 
administered during the field trials. The plurality of errors on the 
Pronoun Test derived from confusion about use of the pronoun /whom/. One 



- 20 - 



Pronoun Test Item showing this confusion had two distractors which each 
received more responses than the correct answer. 

Several popular errors in the Reading Comprehension Test involved a 
distractor based on a key word found In the first line of the test 
passage. One distractor In each of four Comprehension Test items received 
at least as many responses as the correct answer. 

In the context of the Pronoun Test, the distributions of answers ( 
alloted to each answer choice is highly uneven. One item (#18, the last of 
the lower difficulty Items) was answered correctly by 88% of those who got 
to It. In contrast, one item (#2, the second Item in the upper difficulty ( 
set) was answered correctly by only 262 of the students. In the Reading 
Comprehension Test, Items ranged from 94% correct (#6, the last of the 
upper difficulty items) to 35* correct (#18, the last of the lower { 
difficulty set). 

These figures are at odds with the values found during the pilot 
studies, suggesting one of two conclusions. First, the fact of 
computerizing the tests may cause item level performance to be at variance 
to paper-and-pencll performance by virtue of some characteristic of the 

i 

vldeoscreen or keyboard. Certain longer text items tended to fill the 
entire screen, and In some cases that resulted in pincushion or marginal 
distortions and therefore a reduction In clarity. Alternatively, the test 
items themselves may not be sufficiently robust in application to fifth and 
sixth grade pupils, computerization aside. 

i 
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DISCUSSION 

Diagnostic Interpretation 

In practice, students were wholly unaware of the underpinnings of the 
testing, only that each test item was followed in order by another. 
Students did not seem to be concerned that testing times were often very 
different, or that what might appear or their neighbor's video screen would 
not appear on their own. In the present study, such optimizing is shown by 
the ratio of successful to unsuccessful test paths (using the definition 
provided earlier). For the Pronoun Test, the simpler of the two diagnostic 
tests, the ratio is good. For the Comprehension Test, which by definition 
Involves substantially more ambiguity In both the nature of the test 
stimuli and the diagnostic interpretability of competing responses, the 
ratio Is somewhat reduced. Obviously, a study using a less constrained set 
of test materials and a larger number of diagnostic indicators, while 
conceptually more accurate, is not likely to experience the same rates of 
sucessful test path management. The distinctiveness of "successful" vs. 
"unsuccessful" test paths In that study would be reduced unless all other 
variables are held equal. If the test domain Itself is made less 
ambiguous, or If the strength of the diagnostic indicators is improved, 
Improvements in test path susccess are possblle though by no means 
certain. 

In contrast to ad seriatim testing, the controlled environment of a 
computer-managed testing session theoretically allows optimizing of the 
test to the emerging "picture" of the proficiency of the student. Several 
students asked to see their total scores, and, while disappointed to 
discover that total correct score was probably misleading, took an interest 
in the diagnostic particulars. Some then volunteered that they indeed 
recognized one or another error as a tendency in their classroom work. 
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The constraints to diagnostic interpretation are numerous and varied. 
In most extant work with computerized diagnosis, the system integrates a 
sizeable database of known diagnostic Indicators, and constantly refers to 
that database as part of Its deductive strategy. The present formulation 
relies Instead on an abductlve approach, which in turn relies on the 
strength of each test Item and Its relation to the targets of concern, and 
the probative validity of each response In terms of a given diagnostic 
hypothesis. Any Item, no matter how well -formed and well -validated in 
pilot studies, may nonetheless elicit a unique response from an examinee. 
Subsequent diagnostic inferencing which falls to see that uniqueness will 
suffer accordingly. In the present study, unfortunately, we have no 
scientific means of making such assessment. 
Student Behaviors . 

From the behaviors of students Involved in this study, it is clear 
that most upper primary pupils not only have some familiarity with 
microcomputers, but that their expectations of software are very high. 
This Is no doubt due to the visual sophistication of most video arcade 
machines. Students waiting at a computer before a test session frequently 
made some effort to run BASIC or LOGO, only to be rebuffed by the 
unfamiliar p-system operating environment. Some of the brighter students 
appeared mildly frustrated that the actual requirements of the task were 
simplistic even If the test Items were tough: only single keystrokes were 
needed to make one's response. Often, early in testing, the brighter 
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examinees would begin to type their answers directly, despite Instructions 
on screen about hitting one key only. In sum, the visual excitement of a 
text-based testing system Is low. 
Limitations 

In the present study three distinct lessons were learned regarding the 
limitations of diagnostic testing using computers. First, it is essential 
that all facets of the testing software be firewalled, for the weakest 
point In the software defines the system's weakness in actual operation. 
Unexpected keyboard entries ana carriage returns, disk-swapping, directory 
errors, and other events even with low intrinsic probability are plausible 
and must be protected against. 

Second, the number of test Items needed to sufficiently explore 
competing diagnostic hypotheses Is an exponential rather than linear 
function. A stream of correct responses interrupted solely by errors of a 
single kind Is unlikely. Far more prevalent are response patterns which 
are due to one or more of the following: random guessing, Inadvertant 
keystrokes, partial elimination of dlstractors, competing but nonexclusive 
diagnostic behaviors, and behaviors which are not strongly defined by any 
of the working pool of diagnoses under consideration. 

Third, the nature of computerized management of testing In Itself does 
not add a guarantee about the adequacy and appropriateness of the 
concluding diagnostic Interpretation. Rather, it allows a more efficient 
route to that conclusion, which then must be appraised In the appropriate 
context of the theory or theories of reading behavior underlying the test 
Itself. 
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The rule structure which governs event management of a testing session 
Is critical to the utility of computer -aided diagnostic testing. In the 
present system, the decision to keep the rule structure to three levels of 
Item difficulty (a functional minimum) was made both because the structure 
of the tests to be administered was strictly limited, and because adding 
additional complexity would not have added additional Information for 
research purposes* Though the rules utilized In this field study make 
relatively few demands on the test designer and the programmer, they do 
assume that the examinee will answer items In a stable manner — that is, 
answers will be consistent, Items need not be repeated, one wrong response 
contains the same Information value as another, and performance will not 
change materially during the testing session even though the student may 
come to feel familiar, bored or frustrated with the task. 

In theory, each planned Increase In rule structuring simultaneously 
should show an Increasing wel 1 -tempered fit to diagnostic realities, and 
should enable a more flexible and adaptable system overall. One needed 
Improvement Is to build a system which selects the sequence of Items with 
direct reference to the nature of the wrong answers given; answer A leads 
to Items which specifically probe that response 1n detail. The granularity 
of diagnostic Information derived from such detailed structuring is finer 
than before, though It risks cutting the analytic knife more closely than 
can be supported by theory. It must be noted, too, that rules which 
engender any ambiguity whatsoever In their real-time execution will cause 
significant problems for the software. What route should be formulated for 
the examinee whose performance Improves or deteriorates over the course of 
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the testing session? Should a student who Initially does badly always be 
relegated to a lower level of Item difficulty? Alternatively, how much 
complexity In the specification of branching Is likely to be applicable to 
the modal case? As Is frequently true In real-time software design, major 
amounts of programming logic are required for minor amounts of operational 
enhancement, and/or for plausible but rare events. 

Another Issue which cannot be lightly dismissed Is found In the 
heated, highly technical controversy about approximate and plausible 
reasoning, and/or endorsement or belief-based logic as opposed to 
conventional syllogistic or Bayeslan probabilistic reasoning. In many 
respects, the multiple uncertainties surrounding examinee performance may 
be more appropriately modeled by fuzzy truth values and fuzzy proximities 
("A Is very much like B"; "C Is fairly close to D") than by exact logics 
("A Is equivalent to B"; "C does not equal D"). These complex approaches 
to the heuristic* of uncertainty, however well-suited In theory to 
diagnostic analysis of performance, are neither tractable for small samples 
nor easy to corroborate within the scope of educational testing. It is a 
case In which the available computational power potentially exceeds our 
present ability to use It well. One Is forced to take a relatively 
conservative stance In the present circumstances, and use only such 
complexity as Is warranted by the immediate task of providing rough 
diagnostic Indicators. 
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Studl es 


General Trends/Issues 


Automatlcl ty 
Decodi ng 


Calfee & Plontkow- 
ski, 1981; Lesgold 
& Resnlck, 1982; 
Flelsher, Jenkins, 
& Pany, 1979 


Analyzed 1st grader's acquisition of decoding 
causal links between automation and comprehension. 
Training In single word decoding improves compre- 
hension. 


o i gn twora 
Recognition 


*Drum, Calfee, 
& Cook, 1981 

Perfetti & Hogaboam, 
1975 


Examine effects of surface structure on com- 
prehension. 

Relationship between single word decoding and 
comprehension: skilled readers more rapid. 
However, comprehension differences not clearly 
controlled. 


Word 

Meaning 

(Vocabulary) 


Beck, Perfetti, 
& MCkeown 


Relationship between rapid word access and comp: 
Detter comprenenaers more accurate tnougn not 
necessarily more rapid. 


Processing 

Sentence 

Level 

Syntax 


Marshall & Glock, 
1978-79 

Trebasso, 1980 
*Barn1tz, 1980 


Logical relationships are signaled by connectors. 
Paragraphs with short sentences nay be harder 
to comprehend than those with longer but more 
connected sentences. 

Lexical ambiguity, pronominal references, sentence 
context and comprehension. 

Effect of referents In 4 syntactic functions on 3< 
comprehension, grades 2,4,6. 



* Indicates studies with Information which 1s useful for test development e.g., sample items, or domain specifica- 
tions. 
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Theoretical 

and 
Historical 


Studl es 


General Trendr/Issues 




Background 
Knowledge/ 
Schema 




*Beck, Omanson, 
McKeown, Graves, 
& Cooke, 1981 


Knowledge- expanding lesson on general content of 
a text (before subjects read passages) Increases 
their comprehension. 


Content 
Knowledge 

& 

World View 


Ausubel, 1963 

*Mossental, 1979 
Pearson, Hansen, 
& Gordon, 1979 

see also: Langer, 
^Nlchollch, 1970; 
Graves & Cooke, 
1981 

Anderson, 1977 
Lee & Allen, 1963 
Stauffer, 1970 


Advance organizers improve comprehension. 

HI vs. Low content knowlege and comprehension: high 
content knowledge Increases comprehension. 

Prereadlng activities aimed at increasing concept 

knowledge Increases comprehension of low 

ability upper elementary and junior high students. 

Using experience improves comprehension. 



Note - Semantic and conceptual mapping are techniques used to discover or enrich subjects background knowledge 
throughout these studies. 
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Theoretical 

and 
Historical 


Studies 


General Trends/Issues 




*Coll1ns, Brown, 
Larkln, 1979 

Newsome & Galte, 
1971 


People create & revise hypotheses as they read. 
There are micro as well as macro cues which signal 
nypotneses cnanges. 

Short passages are recalled for a longer period of 
time, regardless of background information on the 
passage - see limitations. 


Text 
Structure 


Macro 

o fcU> Jr 

Grammar 

Passage Length 


Other, e.g., 

Essay 

Grammar 






Micro 

Propositions 


Klntsch & Keenan, 
1973; 

*K1ntsch Kozmin- 
sky, et al, 1975; 

Baker, 1979 


Texts with greater number of propositions more dif- 
ficult to comprehend. Texts with logically Incon- 
sistent texts propositions harder to comprehend. 

A richly detailed text which develops only a few 
concepts Is more comprehensible than text which de- 
velops several concepts - regardless of passage 
length. Propositions found In superordinate prop- 
ositions will be recalled more easily than subord- 
inate propositions. Limitation - no relation- 
ships drawn to referents. 
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Theoretical 

and 
Historical 


Studies 


General Trends/Issues 


boci ai 
Interaction 






Reader's 
Perceived 
Purpose 
for reading 


Gal 11 more et. ai , 
1982 


Classroom context/culture and reading: Relationship 
of reader's home learning and communication style to 
classroom communication and learning with reference 
to reading. Includes references to building schema, 
and story grammars. 


Prior 

Instructions/ 
Questions 
given to 
readers 


Mosenthal, 1983 

Langer & Nicholich, 
1980 


Metamemory, social context and comprehension. 

Effects of testing and prereadlng activity to help 
student access relevant Information. Prereadlng 
activities Improved comprehension for average 
readers only. No effect for low and high 
comprehenders. 


Motivation & affect 
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Theoretical 

and 
Historical 


Studies 


General Trends/Issues 




Baker, 1979 
Markman, 1977 


Readers use several strategies to cope (good & bad 
readers have developmental differences in their 
coping strategies). Bad readers stick to bottom 
up relevant strategies (i.e., decoding, sentence 
level syntax). Good readers use a variety of 
strategies Including top down, (i.e., conceptual 
mapping, drawing analogies). 


Metacognltlve 
Skills 


Awareness of 
non-comprehension 


Strategies for 
for coping with 
LQinprenens ion 
Difficulties 


Clay, 1972 
Olshavskl, 1976-77 


I.D. 1st graders' awareness of coping strategies. 

I.D. 10 strategies used by high school subjects for 
tackling difficult passages: examples: 1) concentra- 
ting on decoding and sentence level comprehension; 
using sentence level, paragraph level context for 
meaning; rereading beginning and reading end of 
passage to understand something in the middle. 


General 

Strategies for 
approaching text 

3;* 


*Frede rick sen, 
1983 

Anderson et al (T) 

Collins, Brown & 
Lark In 


Reading as problem solving (see also Baird, 1982). 
Passages can pose ill- or- well structured processes. 

Identified 12 variables that may affect performance in 
finding the main point. Manlpulatd 4 variables: fit 
of passage to strategy; how main point stated; amount 
of Irrelevant information; frequency of supportive 
statements. 

Identified 8 problem solving strategies for approach- 
ing text. These strategies combine micro and macro 
structure elements. 

' '10 

Overlap In strategies identified above. 



Theoretical 

and 
Historical 



Testing 



CM 
CO 



Hierarchies 
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Studies 



General Trends/ Is sues 



Drum, Calfee & 
Cook, 1981 

Haertel 4 Calfee, 
1983 

Bauman, J. 



Anderson et al 
Royer & Cunningham 



Effects of surface structure variables on performance. 



Feature* for describing differences among reading 
achievement tests. 

Linguistic structure and validity of reading comp- 
rehension tests. 

Domain- referenced approach to testing comprehension. 

Most R.C. test world knowledge rather than compre- 
hension. 



Davis, 1968,72 
Spearrltt, 1972 
Thornolke, 1973 
John Carrol, 1976 



Identified 5 skills; 4 factors or reading comp. 

Reanalyzed Davis* Data. Spearrlt found 4 skills, 
3 factors . 

Thorndlke - 3 factors, also a reanalysis of Davis. 

Identifies 8 components of reading comp - 
theoretical only. 

Example of factors found throghout studies. 

- recall of word meaning 

- using context for meaning 

- inferences from content 

- recognizing author's purpose, attitude, tone 

- following structure (sequence) of passage 
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Table 2 



Average Performance by Test Level 



Test level 



Pronouh 



Comprehension 



higher difficulty 
middle difficulty 
lower difficulty 



^correct 
49.5 
63.5 
77.8 



n 

69 
116 
47 



^correct n 
71.8 42 



59.0 
60.2 



116 
74 
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Table 3 



Errors by Error Type 



PRONOUN: 



Nominative Direct Object Object as Preposition 



% of total errors: 
overal 1 



312 



35% 



34% 



% of total errors: 
if S moved up 
If S moved down 



30% 
38% 



34% 
34% 



36% 
29% 



COMPREHENSION: 



Main Idea Inference 



Literal 



% of total errors 



overal 1 



28% 



36% 



39% 



% of total error 
if S moved up 
If S moved down 



15% 
30% 



30% 
37% 



55% 
34% 
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Table 4 

Correct vs. Incorrect Test Paths 



PRONOUN COMPREHENSION 

"Correctly" moved examinee up or clown * 89X 70* 

Apparent misfitted path for examinee 5% 18% 

Unclear for results gathered 6X 12% 
Ratio — 

Moved up : Moved down within test 9:5 2:5 

Moved examinee down on pronoun but up on comprehension 9% 

Moved examinee up on pronoun but down on comprehension 31% 
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Figure 1 

Reading grade level and test performance 
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Pronoun Test: examinees who moved 
to higher level items 

Reading Comprehension Test: 
examinees who moved to higher 
level items 



Reading Comprehension Test: 
examinees who moved to lower 
level Items 

Pronoun Test: examinees who moved i 
to lower level items 
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Reading grade level 
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APPENDIX A 



The sentences below have a missing word. 

Please select the appropriate pronoun to 

fill In the missing words. Be sure to 

keep the meaning of the sentence the same 

The politician cleared his throat and looked solemnly at his audience. He 
began his speech by saying, "I want to express my appreciation to the 
volunteers, to I will always be grateful." 

a. who 

b . they 

c. which 

d. her 

e. whom 

The professor was listening carefully to the famous astronaut. The 

astronaut, the professor admired, had some Interesting theories about 

the effects of space on the body. 

a. whom 

b. he 

c. she 

d. who 

e. which 

The campers were awakening to a crisp morning. John, was usually the 

first one to get up, was snoring loudly. 

a. she 

b. they 

c. which 

d. whom 

e. who 

The actress ran to the stage and grabbed her award. She turned to the 

audience and said, "I want to dedicate this award to my parents, to 

I will be eternally grateful." 

a. who 

b. they 

c. him 

d. which 

e. whom 

The children, were excited about the spelling bee, were also looking 

forward to their surprise. Mr. Hodges had provided cookies and punch as a 
reward for their hard work. 

a. them 

b. which 

c. who 

d. whom 

e. him 

The boys were waiting for Mr. Jones, the vice-principal. The boys, 

Mr. Jones had warned several times, were nervous about seeing him again. 

a. who 

b. which 

c. whom 

d. they 

e. she 47 



The sentences below have a missing word. 

Please select the appropriate pronoun to 

fill In the missing words. Be sure to 

keep the meaning of the sentence the same 

Jenny and her sister saw Tom, their brother, outside while they were 

cleaning the house. "Tom!" exclaimed Jenny, "Please come and help 

girls clean up." 

a . them 

b. she 

c. we 

d. us 

e. their 

Mike and Larry were vacationing in London when they spotted the lead singer 
of a rock group, Clash. As they ran toward him, the singer ran away from 
and disappeared. 

a. them 

b. their 

c. us 

d. him 

e. they 

Our team won the game by 10 points. However, wanted to be good 

sportsmen, so we treated the other team to a pizza lunch. 

a. I 

b. he 

c. we 

d. they 

e. them 

Jim and I won the first place prize for our art. wanted to celebrate 

our victory, so we all went out for soft drinks. 

a. I 

b. he 

c. we 

d. they 

e. them 

Sparkles was panting nervously as her owners tried to calm down. She 

was about to have puppies. 

a. him 

b. she 

c. her 

d. them 

e. he 

The boss, with Jan had to work, had a terrible temper. 

a. he 

b. him 

c. whom 

d. who 

e. which 
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The sentences below have an underlined word or words. 
Please select the appropriate Tonoun to replace 
the underlined words. Be sure to 
keep the meaning of the sentence the same 

Mom praised MARY AND STEVIE . 

a. them 

b. they 

c. us 

d. she 

e. him 

The baby threw the ball to SARA . 

a. her 

b . him 

c. she 

d. his 

e. them 

JOHN AND MARY went sailing. 

a. we 

b. their 

c. them 

d. he 

e. they 

The horse galloped toward JOHN AND ME as we jumped over the fence. 

a. them 

b. us 

c. him 

d. we 

e. theirs 

MARY AND I went shopping together. 

a. he 

b. we 

c. they 

d. I 

e. us 

The police made THE BOY go home. 

a. he 

b. her 

c. them 

d. his 

e. him 
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APPENDIX B 



There are many active volcanoes on earth such as the famous Mauna Loa 
In Hawaii. They occasionally shoot smoke and lava onto the earth's 
surface. The formation of a volcano can be thought of as an underground 
balloon, burled under thin layers of sand and plaster. Hot magma fills the 
balloon and pushes up the ground. Magma Is the hot material that forms the 
earth's center core. The magma pushes upward Inside the volcano and melts 
the surrounding rock and dirt. The melted rock and dirt Is called lava. 
Eventually, the lava will push through the top of the mountain. This 
action causes a volcanic explosion and forms a crater. After the 
explosion, or eruption, the lava cools off and the mountain sides shrink 
somewhat. This Is similar to a balloon shrinking after 1t pops. A volcano 
1s said to be Inactive when It no longer produces eruptions. 

This passage Is mostly about how.... 

a. volcanoes are filled with lava. 

b. mountains are sometimes volcanoes. 

c. volcanoes are created. 

d. magma creates mountains. 

An active volcano... 

a. erupts every other month. 

b. erupts occasionally. 

c. never erupts. 

d. contains cold lava. 

According to the passage, when a balloon pops, Its sides... 

a. shrink. 

b. become larger. 

c. crack. 

d. become rounder. 

Butterflies are created through an Interesting series of stages. 
Adult butterflies lay their eggs on leaves or tree branches. These eggs 
turn Into caterpillars. The caterpillars crawl around the trees and 
bushes, feeding on leaves. Eventually, the caterpillars weave themselves a 
silken shell called a cocoon. The caterpillars sleep Inside the cocoon for 
a few weeks. Inside the cocoon, the caterpillar grows Its butterfly wings, 
antennae and body. When It awakens, the butterfly breaks out of Its cocoon 
and flies away. When It lays eggs, the process will begin again. 

The passage Is mostly about how.... 

a. adult butterflies are made. 

b. caterpillars find food. 

c. cocoons are created. 

d. butterfly eggs are hatched. 

According to the passage, how does the caterpillar use the cocoon? 

a. The caterpillar stores Its food In the cocoon. 

b. The caterpillar eats the cocoon while It changes Into a butterfly. 

c. The caterpillar sleeps In the cocoon while It changes Into a 
butterfly. 

d. The caterpillar lays eggs In the cocoon where they hatch. 



so 



What do caterpillars eat? 

a. eggs. 

b. silk. 

c. leaves. 

d. branches. 

A crow who had stolen a piece of cheese was flying toward a tall 
tree. She was going to eat her prize, when a hungry fox saw her. "If I 
plan this right," thought the Fox, "I'll have cheese for supper." 

So as he sat under the tree. Fox falsely spoke 1n the most polite 
tones, "Hello, Miss Crow, how pretty you look today!" he Hed. "Your 
wings are so broad and your feathers are so black. You hunt so well. I 
haven't heard your voice but I'm sure that It's finer than that of any 
other bird." 

Crow was so flattered that she waggled her tall and flapped her wings 
to show her pleasure. She really liked what Fox said about her voice 
because she had been told that her caw was a bit rusty. So, laughing 
Inwardly, she decided to surprise the Fox. She opened her mouth to sing 
with her beautiful voice. Down dropped the cheese and the Fox snatched 
It. Licking his chops Fox said to the Crow, "Next time someone praises 
your beauty be sure to hold your tongue!" 

What did Fox say was pretty about Crow? 

a. Her body, voice, and hunting ability. 

b. Her tall. 

c. Her feathers. 

d. Her cheese, feathers, and singing ability. 

What was Crow's surprise for the Fox? 

a. Her cheese. 

b. Her voice. 

c. Her beauty. 

d. Her wings. 

This story Is mostly about how... 

a. the crow and the fox made friends. 

b. the fox outsmarted the crow. 

c. the fox lied. 

d. the crow dropped the cheese. 

How would you like to know how to make a steam engine In your 
kitchen? For your safety, be sure to try this experiment with an adult 
around. First, make a plnwheel with some paper and a thin stick. Cut the 
paper Into a circle, or Into two propellers. Now push a pin through the 
middle of the circle or propellers and pin It to the top of the stick. 

Now put the stick on a clothespin. Next, get a teapot. Then put the 
clothespin on the teapot's spout. Now fill the teapot with water and put 
It on the stove. Let the water come to a boll. Now watch the steam make 
the plnwheel spin. The stream of steam coming out of the spout pushes the 
plnwheel and makes 1t spin. Steam is hot moist air. All hot air travels 
upward. Now turn off the teapot and watch the pinwheel come to a stop as 
the steam is no longer coming out. 



This passage 1s mostly about... 

a. how to make a steam engine from a teapot. 

b. how to use steam carefully. 

c. how to put together a plnwheel 

d. how to use the stove safely. 

In the passage, a plnwheel Is a: 

a. wheel of sticks held together by paper and a clothespin. 

b. firework fixed on a stick which turns around when lit by a match. 

c! lollipop which Is the shape of a circle and has a spiral design on it. 
d. wheel of paper attached by a pin to a stick so that it will turn 
around. 

Why will the plnwheel stop when the teapot is turned off? 

a. The steam Is coming out. 

b. The plnwheel Is too soggy. 

c. The steam Is no longer coming out. 

d. The plnwheel no longer works. 

One day all the trees and flowers were gray. The great leader looked 
around and said, "This Is dreary!" Who wants to paint all the flowers and 
trees for mel" Peacock volunteered to do It. In those days, he was ugly 
and had short tall feathers, so he felt very bad abut himself. 

The great leader told Peacock how to do this job. Peacock sadly said, 
"I'm so ugly. I don't know If I can do this job. What do I know about 
beauty? But the great leader made Peacock do It. - "" 

Peacock flew all over the earth. He collected all kinds of colors and 
used his tail feathers to paint. At the end of the day, Peacock's 
tall feathers were falling out. 

"You have done a wonderful job," said the great leader. It shows the 
beauty Inside of you. I will give you long colorful tall feathers to remind 
you of this." 

This story Is mostly about... 

a. why the Peacock was ugly. 

b. how the Peacock got Its tail feathers. 

c. why the flowers were gray. 

d. how the Peacock lost Its tail feathers. 

Why did the great leader want the flowers and trees painted? 

a. Everything was purple and he thought this looked funny. 

b. Peacock collected all kinds of colors. 

c. Peacock volunteered to paint them. 

d. Everything was gray and this made evrythlng look drab. 

Why did the great leader give the Peacock pretty tall feathers? 

a. To remind Peacock of his achievement. 

b. Because Peacock's feathers were short. 

c. To punish Peacock for his poor job. 

d. Because Peacock was gray and ugly. 
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Along the shores of a lake lived a man named Parsee. He loved to eat 
cake. One day he made himself a giant cake three feet thick. Just as 
Parsee was going to eat the cake, along came a rhinoceros. In those days, 
Rhino had very smooth, tight fitting skin. 

Rhino saw the cake, spiked It with his horn, and ate most of it. 
Parsee was very angry. Rhino was rude, he should have asked for a piece of 
cake. 

A few weeks later there was a heatwave. It was so hot that everyone 
took off their skin and went swimming. Parsee found Rhino's skin and 
filled It with dried cake crumbs. He was going to teach Rhino to be 
polite. The crumbs would remind Rhino of what he had done. 

When Rhino put his skin back on, the crumbs made him itch. The more 
he scratched the worse It Itched. His skin has been baggy every since. 

Why Was Parsee angry at Rhino? 

a. Because Parsee loved to eat cake. 

b. Because Rhino was rude to steal the cake. 

c. Because Rhino left his skin by the cake. 

d. Because Parsee felt Irritable In the heat. 

What did Parsee want to teach Rhino? 

a. To be neat. 

b. To be polite. 

c. To feel Itchy. 

d. To eat less. 

This story Is mostly about... 

a. how much Rhinos love cake. 

b. how the Rhino got baggy skin. 

c. how Parsee got hungry. 

d. how animals used to cool off. 
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